19,549 research outputs found

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Longitudinal trends in prostate cancer incidence, mortality, and survival of patients from two Shanghai city districts: a retrospective population-based cohort study, 2000-2009.

    Get PDF
    BackgroundProstate cancer is the fifth most common cancer affecting men of all ages in China, but robust surveillance data on its occurrence and outcome is lacking. The specific objective of this retrospective study was to analyze the longitudinal trends of prostate cancer incidence, mortality, and survival in Shanghai from 2000 to 2009.MethodsA retrospective population-based cohort study was performed using data from a central district (Putuo) and a suburban district (Jiading) of Shanghai. Records of all prostate cancer cases reported to the Shanghai Cancer Registry from 2000 to 2009 for the two districts were reviewed. Prostate cancer outcomes were ascertained by matching cases with individual mortality data (up to 2010) from the National Death Register. The Cox proportional hazards model was used to analyze factors associated with prostate cancer survival.ResultsA total of 1022 prostate cancer cases were diagnosed from 2000 to 2009. The average age of patients was 75 years. A rapid increase in incidence occurred during the study period. Compared with the year 2000, 2009 incidence was 3.28 times higher in Putuo and 5.33 times higher in Jiading. Prostate cancer mortality declined from 4.45 per 105 individuals per year in 2000 to 1.94 per 105 in 2009 in Putuo and from 5.45 per 105 to 3.5 per 105 in Jiading during the same period. One-year and 5-year prostate cancer survival rates were 95% and 56% in Putuo, and 88% and 51% in Jiading, respectively. Staging of disease, Karnofsky Performance Scale Index, and selection of chemotherapy were three independent factors influencing the survival of prostate cancer patients.ConclusionsThe prostate cancer incidence increased rapidly from 2000 to 2009, and prostate cancer survival rates decreased in urban and suburban Chinese populations. Early detection and prompt prostate cancer treatment is important for improving health and for increasing survival rates of the Shanghai male population

    A Fully Polynomial Time Approximation Scheme for the Replenishment Storage Problem

    Full text link
    The Replenishment Storage problem (RSP) is to minimize the storage capacity requirement for a deterministic demand, multi-item inventory system where each item has a given reorder size and cycle length. The reorders can only take place at integer time units within the cycle. This problem was shown to be weakly NP-hard for constant joint cycle length (the least common multiple of the lengths of all individual cycles). When all items have the same constant cycle length, there exists a Fully Polynomial Time Approximation Scheme (FPTAS), but no FPTAS has been known for the case when the individual cycles are different. Here we devise the first known FPTAS for the RSP with different individual cycles and constant joint cycle length
    • …
    corecore